47 research outputs found

    On the overestimation of random forest’s out-of-bag error

    Get PDF

    On the overestimation of random forest’s out-of-bag error

    Get PDF

    Resampling approaches in biometrical applications

    Get PDF

    Prediction Models for Time Discrete Competing Risks

    Get PDF
    The classical approach to the modeling of discrete time competing risks consists of fitting multinomial logit models where parameters are estimated using maximum likelihood theory. Since the effects of covariates are specific to the target events, the resulting models contain a large number of parameters, even if there are only few predictor variables. Due to the large number of parameters classical maximum likelihood estimates tend to deteriorate or do even not exist. Regularization techniques might be used to overcome these problems. This article explores the use of two different regularization techniques, namely penalized likelihood estimation methods and random forests, for modeling time discrete competing risks using both, extensive simulation studies and studies on real data. The simulation results as well as the application on three real world data sets show that the novel approaches perform very well and distinctly outperform the classical (unpenalized) maximum likelihood approach

    Resampling approaches in biometrical applications

    Get PDF

    Overview of Random Forest Methodology and Practical Guidance with Emphasis on Computational Biology and Bioinformatics

    Get PDF
    The Random Forest (RF) algorithm by Leo Breiman has become a standard data analysis tool in bioinformatics. It has shown excellent performance in settings where the number of variables is much larger than the number of observations, can cope with complex interaction structures as well as highly correlated variables and returns measures of variable importance. This paper synthesizes ten years of RF development with emphasis on applications to bioinformatics and computational biology. Special attention is given to practical aspects such as the selection of parameters, available RF implementations, and important pitfalls and biases of RF and its variable importance measures (VIMs). The paper surveys recent developments of the methodology relevant to bioinformatics as well as some representative examples of RF applications in this context and possible directions for future research

    Random Forests for Ordinal Response Data: Prediction and Variable Selection

    Get PDF
    The random forest method is a commonly used tool for classification with high-dimensional data that is able to rank candidate predictors through its inbuilt variable importance measures (VIMs). It can be applied to various kinds of regression problems including nominal, metric and survival response variables. While classification and regression problems using random forest methodology have been extensively investigated in the past, there seems to be a lack of literature on handling ordinal regression problems, that is if response categories have an inherent ordering. The classical random forest version of Breiman ignores the ordering in the levels and implements standard classification trees. Or if the variable is treated like a metric variable, regression trees are used which, however, are not appropriate for ordinal response data. Further compounding the difficulties the currently existing VIMs for nominal or metric responses have not proven to be appropriate for ordinal response. The random forest version of Hothorn et al. utilizes a permutation test framework that is applicable to problems where both predictors and response are measured on arbitrary scales. It is therefore a promising tool for handling ordinal regression problems. However, for this random forest version there is also no specific VIM for ordinal response variables and the appropriateness of the error rate based VIM computed by default in the case of ordinal responses has to date not been investigated in the literature. We performed simulation studies using random forest based on conditional inference trees to explore whether incorporating the ordering information yields any improvement in prediction performance or variable selection. We present two novel permutation VIMs that are reasonable alternatives to the currently implemented VIM which was developed for nominal response and makes no use of the ordering in the levels of an ordinal response variable. Results based on simulated and real data suggest that predictor rankings can be improved by using our new permutation VIMs that explicitly use the ordering in the response levels in combination with the ordinal regression trees suggested by Hothorn et al. With respect to prediction accuracy in our studies, the performance of ordinal regression trees was similar to and in most settings even slightly better than that of classification trees. An explanation for the greater performance is that in ordinal regression trees there is a higher probability of selecting relevant variables for a split. The codes implementing our studies and our novel permutation VIMs for the statistical software R are available at http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmittel/janitza/index.html

    A computationally fast variable importance test for random forests for high-dimensional data

    Get PDF
    Random forests are a commonly used tool for classification with high-dimensional data as well as for ranking candidate predictors based on the so-called variable importance measures. There are different importance measures for ranking predictor variables, the two most common measures are the Gini importance and the permutation importance. The latter has been found to be more reliable than the Gini importance. It is computed from the change in prediction accuracy when removing any association between the response and a predictor variable, with large changes indicating that the predictor variable is important. A drawback of those variable importance measures is that there is no natural cutoff that can be used to discriminate between important and non-important variables. Several approaches, for example approaches based on hypothesis testing, have been developed for addressing this problem. The existing testing approaches are permutation-based and require the repeated computation of forests. While for low-dimensional settings those permutation-based approaches might be computationally tractable, for high-dimensional settings typically including thousands of candidate predictors, computing time is enormous. A new computationally fast heuristic procedure of a variable importance test is proposed, that is appropriate for high-dimensional data where many variables do not carry any information. The testing approach is based on a modified version of the permutation variable importance measure, which is inspired by cross-validation procedures. The novel testing approach is tested and compared to the permutation-based testing approach of Altmann and colleagues using studies on complex high-dimensional binary classification settings. The new approach controlled the type I error and had at least comparable power at a substantially smaller computation time in our studies. The new variable importance test is implemented in the R package vita

    Categorical variables with many categories are preferentially selected in model selection procedures for multivariable regression models on bootstrap samples

    Get PDF
    To perform model selection in the context of multivariable regression, automated variable selection procedures such as backward elimination are commonly employed. However, these procedures are known to be highly unstable. Their stability can be investigated using bootstrap-based procedures: the idea is to perform model selection on a high number of bootstrap samples successively and to examine the obtained models, for instance in terms of the inclusion of specific predictor variables. However, from the literature such bootstrap-based procedures are known to yield misleading results in some cases. In this paper we aim to thoroughly investigate a particular important facet of these problems. More precisely, we assess the behaviour of regression models--with automated variable selection procedure based on the likelihood ratio test--fitted on bootstrap samples drawn with replacement and on subsamples drawn without replacement with respect to the number and type of included predictor variables. Our study includes both extensive simulations and a real data example from the NHANES study. The results indicate that models derived from bootstrap samples include more predictor variables than models fitted on original samples and that categorical predictor variables with many categories are preferentially selected over categorical predictor variables with fewer categories and over metric predictor variables. We conclude that using bootstrap samples to select variables for multivariable regression models may lead to overly complex models with a preferential selection of categorical predictor variables with many categories. We suggest the use of subsamples instead of bootstrap samples to bypass these drawbacks
    corecore